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Near optimal decoding of good error control codes is generally a difficult task. However, for 
a certain type of (sufficiently) good codes an efficient decoding algorithm with near optimal 
performance exists. These codes are defined via a combination of constituent codes with low 
complexity trellis representations. Their decoding algorithm is an instance of (loopy) belief 
propagation and is based on an iterative transfer of constituent beliefs. The beliefs are thereby 
given by the symbol probabilities computed in the constituent trellises. Even though weak con- 
stituent codes are employed close to optimal performance is obtained, i.e., the encoder/decoder 
pair (almost) achieves the information theoretic capacity. However, (loopy) belief propagation 
only performs well for a rather specific set of codes, which limits its applicability. 
In this paper a generalisation of iterative decoding is presented. It is proposed to transfer 
more values than just the constituent beUefs. This is achieved by the transfer of beliefs ob- 
tained by independently investigating parts of the code space. This leads to the concept of 
discriminators, which are used to improve the decoder resolution within certain areas and 
defines discriminated symbol beliefs. It is shown that these beliefs approximate the overall 
symbol probabilities. This leads to an iteration rule that (below channel capacity) typically 
only admits the solution of the overall decoding problem. Via a GAUSS approximation a low 
complexity version of this algorithm is derived. Moreover, the approach may then be applied 
to a wide range of channel maps without significant complexity increase. 

Keywords: Iterative Decoding, Coupled Codes, Information Theory, Complexity, BeUef Propagation, 
Typical Decoding, Set Representations, Central Limit Theorem, Equalisation, Estimation, Trellis Algo- 
rithms 

DECODING error control codes is the inversion of the encoding map in the presence of errors. An 
optimal decoder finds the codeword with the least number of errors. However, optimal decoding is 
generally computationally infeasible due to the intrinsic non linearity of the inversion operation. Up to now 
only simple codes can be optimally decoded, e.g., by a simple trellis representation. These codes generally 
exhibit poor performance or rate ifTTl . 

On the other hand, good codes can be constructed by a combination of simple constituent codes (see 
e.g., flT, pp.567ff]). This construction is interesting as then a trellis based inversion may perform almost 
optimally: Berrou et al. |2| showed that iterative turbo decoding leads to near capacity performance. The 
same holds true for iterative decoding of Low Density Parity Check (LDPC) codes 161. Both decoders 
are conceptually similar and based on the (loopy) propagation of beliefs (W] computed in the constituent 
trellises. However, (loopy) belief propagation is often limited to idealistic situations. E.g., turbo decoding 
generally performs poorly for multiple constituent codes, complex channels, good constituent codes, and/or 
relatively short overall code lengths. 

In this paper a concept called discrimination is used to generalise iterative decoding by (loopy) belief 
propagation. The generalisation is based on an uncertainty or distance discriminated investigation of the 
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code space. The overall results of the approach are linked to basic principles in information theory such as 
typical sets and channel capacity ifTSl [TSl [T3l . 

Overview: The paper is organised as follows: First the combination of codes together with the decoding 
problem and its relation to belief propagation are reviewed. Then the concept of discriminators together 
with the notion of a common belief is introduced. In the second section local discriminators are discussed. 
By a local discriminator a controllable amount of parameters (or generalised beliefs) are transferred. It is 
shown that this leads to a practically computable common belief that may be used in an iteration. Moreover, 
a fixed point of the obtained iteration is typically the optimal decoding decision. Section 3 finally considers 
a low complexity approximation and the application to more complex channel maps. 



1. Code Coupling 



To review the combination of constituent codes we here consider only binary linear codes C given by the 
encoding map 

C : X = {xi, . . . , Xk) c = (ci, . . . , c„) = xG mod 2 
with G the (fc x n) generator matrix with Xi, Ci, and Gij G Z2 = {0, 1}. 

The map defines for rank(G) = k the event set E(C) of 2*^ code words c. The rate of the code is given by 
R = k/n and it is for an error correcting code smaller than One. 

The event set E(C) is by linear algebra equivalently defined by a ((n — fc) x n) parity matrix H with 
HG^ = mod 2 and thus 

E(C) = {c : Hc^ = mod 2}. 
Note that the modulo operation is in the sequel not explicitly stated. 

E(C) is a subset of the set § of all 2" binary vectors of length n. The restriction to a subset is interesting 
as this leads to the possibility to correct corrupted words. However, the correction is a difficult operation 
and can usually only be practically performed for simple or short codes. 

On the other hand long codes can be constructed by the use of such simple constituent codes. Such con- 
structions are reviewed in this section. 

Definition 1 (Direct Coupling) The two constituent linear systematic coding maps 



-ii) 



X ■ G(') = X ■ [7P(')] within 1,2 



and a direct coupling gives the overall code E(C*^°'') with c*^°) — x ■ [/P^^^ P*^^)]. 

Example 1 The constituent codes used for turbo decoding (Jl 
are two systematic convolutional codes 1 10] with low trellis de- x 
coding complexity (See Appendix lA.il ). The overall code is ob- 
tained by a direct coupling as depicted in the figure to the right. 
The encoding of the non-systematic part P*^') can be done by a 
recursive encoder The 11 describes a permutation of the input 
vector X, which significantly improves the overall code prop- 
erties but does not affect the complexity of the constituent de- 
coders. If the two codes have rate 1 /2 then the overall code will 
have rate 1/3. 

Another possibility is to concatenate two constituent codes as defined below. 

Definition 2 (Concatenated Codes) By 

c(i) = xG^^^ andc<'^^ = c^^^G^^) ^ xG^'^G^'^ 

(provided matching dimensions, i.e. a (k x n'^-*) generator matrix G*^^^ and a {rS^^ x n) generator matrix 
G'^'^^) a concatenated code is given. 
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Remark 1 (Generalised Concatenation) A concatenation can be used to construct codes with defined 
properties as usually a large minimum HAMMING distance. Note that generalised concatenated f^^\ codes 
exhibit the same basic concatenation map. There distance properties are investigated under an additional 
partitioning of code G^'^-' . 

Another possibiUty to couple codes is given in the following definition. This method will show to be very 
general, albeit rather non intuitive as the description is based on parity check matrices H. 

Definition 3 (Dual Coupling) The overall code 

CH :.E(C(")) = |c:[2|;j]c^ = o} 

is obtained by a dual coupling of the constituent codes C^'-* E(C''-' ) — {c : H'^^^c^ — 0}/or I = 1,2. 

By a dual coupling the obtained code space is obtained by the intersection C'-"-' — C*^^-' n C*^^' of the 
constituent code spaces. 

Example 2 A dually coupled code construction similar to turbo codes is to use two mutually permuted 
rate 2/3 convolutional codes. The intersection of these two codes gives a code with rate at least 1/3. To 
obtain a larger rate one may employ puncturing (not transmitting certain symbols). However, the encoding 
of the overall code is not as simple as for direct coupling codes. A straightforward way is to just use the 
generator matrix representation of the overall code. 

Remark 2 (LDPC Codes) LDPC codes are originally defined by a single parity check matrix with low 
weight rows (and columns). An equivalent representation is via a graph of check nodes (one for each 
column) and variables nodes (one for each row). This leads to a third equivalent representation with two 
dually coupled constituent codes and a subsequent puncturing lfT2l . The first constituent code is thereby 
given by a juxtaposition of repetition codes that represent the variable nodes (all node inputs need to 
be equal). The second one is defined by single parity check codes representing the check nodes. The 
puncturing at the end has to be done such that only one symbol per repetition code (code column) remains. 

Tlieorem 1 Both direct coupling and concatenated codes are special cases of dual coupling codes. 

Proof: The direct coupling code is equivalently described in the parity check form G'^"''^ = by the 
parity check matrix 



where H^''> ^ [H'^"^^ if for ^ 1, 2 



is the parity check matrix of G^'-' consisting of systematic part ift**'' and redundant part H'^^'^K This is 
obviously a dual coupling. For a concatenated code with systematic code G*^^' = [/ P^^^] the equivalent 
description by a parity check matrix is 



jj{s2) jj{r2) 



With and ^ [i/(-2) j^(,.2)] 



the parity check matrix of G'^-* respectively G'^^. For non-systematic concatenated codes a virtual sys- 
tematic extension (punctured prior to the transmission) is needed L12J . Hence, a representation by a dual 
coupling is again possible. 

It is thus sufficient to consider only dual code couplings. The "dual" is therefore mostly omitted in the 
sequel. 

Remark 3 (Multiple Dual Codes) More than two codes can be dually coupled as described above: By 

a coupling of three codes is given. The overall parity check matrix is there given by the juxtaposition of 
the three constituent parity check matrix. Multiple dual couplings are produced by multiple intersections. 
In the sequel mostly dual couplings with two constituent codes are considered. 
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1.1. Optimal Decoding 

As stated above the main difficulty is not the encoding but the decoding of a corrapted word. This corrup- 
tion is usually the result of a transmission of the code word over a channel. 

Remark 4 (Channels) In the sequel we assume that the code symbols Cj are in B = {—1, +!}• This is 
achieved by the use of the "BPSK"-map 



B : X y = 



-1 for a; = 
- 1 for X = 1 



prior to the transmission. As channel we assume either a Binary Symmetric Channel (BSC) with channel 
error probability p and 

n n ^ n 

P{r\s) = l[{l-p)<^^=^^>p<^^^^^^ oc Hi^y^^^ = J|exp2(sinlog2(— ^)) 

i=l i=l ^ 1=1 P 

" 1 — 

= exp2(Ky^ Stft) withK = log2( ) and Si,ri e B = {— 1,+1} 

i=i ^ 

with 



if 6 false 

1 if 6 true 



(6) = 

or a channel with Additive White GAUSS Noise (AWGN) given by 

n n 

P{r\s) oc Yl 2-('''-"')' (X exp2(^ nsi) and Sj e B 

i=l i=l 

(this actually is the GAUSS probability density) and the by 2a% = log2(e) normaUsed noise variance. The 

received elements are in the AWGN case real valued, i.e., € R. 

Note that the normalised noise variance is obtained by r^^'' <— Krf^ and an appropriate constant K. More- 
over, then both cases coincide. 



Overall this gives that decoding is based on < 



1) the knowledge of the code space E(C), 

2) the knowledge of the channel map given by P{r\c), and 

3) the received information represented by r. 

A decoding can be performed by a decision for some word c, which is in the Maximum Likehhood (ML) 
word decoding case 

c = arg max P(r\c) 

ceE(c) 

or decisions on the code symbols by ML symbol by symbol decoding 

Cj = argmaxPif^xIr) = argmax P(r\c). 

ceE(C), Ci=x 

Here P^^xlr) is the probability that Ci = x under the knowledge of the code space E(C). If no further 
prior knowledge about the code map or other additional information is available then these decisions are 
obviously optimal, i.e., the decisions exhibit smallest word respectively Bit error probability. 

Remark 5 (Dominating ML Word) If by p(")(c(")|r) ^ 1 a dominating ML word decision exists then 
necessarily holds that c*^"' = c'") . The decoding problem is then equivalent to solving either of the ML 
decisions. 

ML word decoding is for the BSC equivalent to find the code word with the smallest number of errors 
Ci 7^ Ti, respectively the smallest Hamming distance dnic, r). For the AWGN channel the word c that 
minimises Euclid's quadratic distance rf|;(c, r) = ||r — c||^ needs to be found. 
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For the independent channels of Remark|4]the ML decisions can be computed (see Appendix lA.ll i in the 
code trelHs by the VlTERBl or the BCJR algorithm. However, due to the generally large trellis complexity 
of the overall code these algorithms do there (practically) not apply. 

On the other hand one may compute the "uncoded" word probabilities 

P{s\r) (X P{r\s) {s e , (1) 
and for small constituent trellis complexities the constituent code word probabilities 

for 1 — 1,2 with § := E{S) the set of all words. This is interesting as the overall code word distribution 

P('^)(s|r) := PcM\R{s\r) cx P{r\s) ■ (^s e C^^^ 
can be computed out of p(') (s|r) and P{s\r): It holds with Definition [3] that C'"' = C^^^ n C^^^ and thus 

P(i)(s|r) • P(2)(s|r) cx (P(r|s))^ • (^s e C^^^J) • (^s e C(2)J) = (P(r|s))^ • (^s e 
which gives with ([T]| that 

If the constituent word probabilities are all known then optimal decoding decisions can be taken. I.e., one 
can compute the ML word decision by 

p(i)(s|r)P(2)(s|r) 

c(") = arg max L\ . (3) 

ses P[s\r) 



or the ML symbol decisions by 



=ai-gmaxP/fVa;|r) =ai-gmax V P^"^' (sir) = ai-g max V -^^^^^1^)^^^^^!^) 
with Si(x) {s e S : Si ~ x}. 

Decoding decisions may therefore be taken by the constituent probabilities. However, one may by (|2|i 
only compute a value proportional to each single word probability. The representation complexity of the 
constituent word probability distribution remains prohibitively large. I.e., the decoding decisions by Q 
and (01 do not reduce the overall complexity as all word probabilities have to be jointly considered, which 
is equivalent to investigating the complete code constraint. 



1.2. Belief Propagation 

The probabilities of the two constituent codes thus contain the complete knowledge about the decoding 
problem. However, the constituent decoders may not use this knowledge (with reasonable complexity) 
as then 2" values need to be transferred. I.e., a realistic algorithm based on the constituent probabihties 
should transfer only a small number of parameters. 

In (loopy) belief propagation algorithm this is done by transmitting only the constituently "believed" sym- 
bol probabilities but to repeat this several times. This algorithm is here shortly reviewed: One first uses a 
transfer vector w'^^^ to represent the believed (x\r ) of code 1. This belief representing transfer vector 
is then used together with r in the decoder of the other constituent code. I.e., a transfer vector lo'^^) is 
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computed out of P^^^ (xjr, ii;'^^)) that will then be reused for a new w^^' by P^\x\r , w^'^'' ) and so forth. 
The algorithm is stopped if the beliefs do not change any further and a decoding decision is emitted. 

The beliefs P^\x\r, w^''^) for /, ft, e {1, 2} and I ^ h are obtained by 

P{r,w^^'>\s) = P{w'^'^\s)P{r\s), 

which is a in It) and r independent representation. Moreover, it is assumed that Si E B = { — 1,+1} and 
that 

n n 

P{w^''>\s) = l[P{wf'>\s,) (X exp^iY^wfh,) = exp2(-u;«s^) (5) 
are of the form of P{r\s) in Remark|4] 

Remark 6 (Distributions and Trellis) Obviously many other choices for P{w^^''\s) exist. However, the 
again independent description of the symbols Ci = 5^ in Q leads to (see Appendix lA.il ) the possibility 
to use trellis based computations, i.e., the symbol probabilities P^l,'^ {x\r, ii;'''' ) can be computed as before 

The transfer vector ii;'^^) for belief propagation for given r and w'^''^ with l,h £ {1,2} and h ^ lis defined 
by 

Pc,ix\r,w^^\w^^'>) = P^'^{x\r,w^^'>) foralH. (6) 

I.e., the beliefs under r, w^^\ w^^\ and no further set restriction are set such that they are equal to the 
beliefs under w'-''\ r, and the knowledge of the set restriction of the h-th constituent code. This is always 
possible as shown below. 

Remark 7 (Notation) To simplify the notation we set in the sequel 
and often w^^'' := r. 

For the Mncoc/ec/ beliefs P^ {x\m) it is again assumed that the information and belief carrying r, it)^^' and 
ii;'^-' are independent, i.e., 

P{m\c) = P{r,w^^\w'^^^\c) = P{r\c)P{w'^^^\c)P{w^^^\c). (7) 

The computation of lyC') for given ti;''' is then simple as the independence assumptions (|5]l and (|7]i give 
that 

PcMm) = PcM^ln + u'f ^ + ^ 
Moreover, the definition of the w^^^^ is simplified by the use of logarithmic probability ratios 

1 Pc. (+l|m) (n 1 pg^(+l|mW) 
L,{m) = - log2 p , . and i '{m'-')^7: log2 — — — 

2 Pc,[-l\m) 2 P^'^(-l|m(')) 

for I — 1,2. This representation is handy for the computations as ^ directly gives 

Li{m) ^ri+ wf' + wf' 

and thus that Equation Q is equivalent to 

' = L^P (m(') )-n- ^ for l^hwA all i. (8) 

This equation can be used as an iteration rule such that the uncoded beliefs are subsequently updated by 
the constituent beliefs. The transfer vectors w'^^^ and ti;'^^ are thereby via (|8]l iteratively updated. The 
following definition further simplifies the notation. 
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Algorithm 1 Loopy Belief Propagation 



1. Set = = 0, / = 1, and h = 2. 

2. Swap I and h. 

3. Sett(;(" = L(')(m(')). 

4. IfiyC') 7^ i('')(m('')) thengotoStep2. 

5. Set Q = sign(ri + + uj^') for all z. 



Definition 4 (Extrinsic Symbol Probability) The extrinsic symbol probability of code I is 
P^'^(a;|m«) cx P<l'/(x|m(')) exp2(-x(wf ^ - r,))/or /i L 

The extrinsic symbol probabilities are by (|5]l independent of for ^ = 1, 2 and r.i, i.e., they depend 

only on belief and information carrying wj'' and rj from with j ^ i other or "extrinsic" symbol positions. 
Moreover, one directly obtains the extrinsic logarithmic probability ratios 

\ log, 4;^^^ ~n- = Pf )(m(')) - - for I ^ /z. (9) 

With Equation (|8]l this gives the iteration rule 

and thus Algorithm[T] Note that one generally uses an alternative, less stringent stopping criterion in Step 4 
of the algorithm. 

If the algorithm converges then one obtains that 
and 

Q = sign(i,(r) + )(m(i)) + Lf\m^^^)) (10) 

with Li{r) = r^. This is a rather intuitive form of the fixed point of iterative belief propagation. The 
decoding decision ci is defined by the sum of the (representations of the) channel information and the 
extrinsic constituent code beliefs {m}-^^). 

Remark 8 (Performance) If the algorithm converges then simulations show that the decoding decision is 
usually good. By density evolution |17 | or extrinsic information transfer charts |19| the convergence of 
iterative belief propagation is further investigated. These approaches evaluate, which constituent codes are 
suitable for iterative belief propagation. This approach and simulations show that only rather weak codes 
should be employed for good convergence properties. This indicates that the chosen transfer is often too 
optimistic about its believed decisions. 



1.3. Discrimination 

The belief propagation algorithm uses only knowledge about the constituent codes represented by w^^\ 
In this section we aim at increasing the transfer complexity by adding more variables and hope to obtain 
thereby a better representation of the overall information and thus an improvement over the propagation of 
only symbol beliefs. 
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Reconsider first the additional belief representation w''^ given by the distributions P{s\w^^^) and P{s\w^'^^) 
used for belief propagation. The overall distributions are 

Pis\m) = F(s|r,u;(i),i(;(2)) oc P{r\s)P{w^^^s)P{w^^^s) 
P(i)(s|m(i)) = P(i)(s|r,t(;(2)) oc P{r\s)P{w(^^s) (11) 
P(2)(s|m(2)) = P(2)(s|r,t(;(i)) cx P(r |s)P(t(;(i) |s). 



The following lemma first gives that these additional beliefs do not change the computation of the overall 
word probabilities. 

Lemma 1 It holds for all w'^^'> , tu^^' that 

^ ' ' P{s\m) 

Proof: A direct computation of the equation with (fTTT i gives as for ^ equality. The terms that depend on 
ii;'^'^ vanish by the independence assumption Q. 



To increase the transfer complexity now additional parameters are added to s. This first seems counter 
intuitive as no new knowledge is added. However, with Lemma[T]the same holds true for the belief carrying 
tt)''^ and optimal decoding. 

Definition 5 (Word uncertainty) The uncertainty augmented word probability P^^^{s, u\m^'^^) is 
pW(s,M|m(")) :=p(")(s|m("))f[(uj=ti;«s^) 

1=0 

with U = U{s) — (uq, Ul, U2)- 



This definition naturally extends to P(s, u\m) and to P'°)(s, u\r). 

Remark 9 (Notation) The notation of P*^"' {s, u\r) does not reflect the dependency on m. The same holds 
true for p(''(s, M|m''') etc. A complete notation is for example P(s, M|m||r) or p('^(s, M|m||m('^). To 
maintain readability this dependency will not be explicitly stated in the sequel. 

Under the assumption that code words with the same u do not need to be distinguished one obtains the 
following definition. 

Definition 6 (Discriminated Distribution) The distribution of u discriminated by m is 

P^(„|„.)oc^"'"'l'""''f';''°l'"'") 

^ ' ' P{u\m) 

mthj:^^^P^{u\m) = l, 

P(')(w|m(')) = ^p(')(s,M|m(')), 
and U = E{U) ^ {u : ui = u;(') s'^ \/l and s e S}. 

Remark 10 (Discrimination) Words s with the same u are not distinguished. As m and s define u the 
discrimination of words is steered by m. The variables ui are then used to relate to the distances ||c— iD^'^p 
(see Remark|4|i. Words that do no not share the same distances are discriminated. The choice of u and dSjl 
is natural as all code words with the same u have the same probability, i.e., that 

2 2 

p(')(s,M|m('))ocexp2( ^ Uk) ■ ]]_ (^u^ ^ w^'^ s'^Y (^s e C^'^) (12) 

k=0,k^l j=0 
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and similar for P{s, u\m) and P^'^^ (s, u\r). Generally it holds that u is via 

2 

^ Mfc = K + log2 P{s\m) = K - H{s\m) 

(with K some constant) related to the uncertainty H{s\m). Note that any map of s on some u wiU 
define some discrimination. However, we will here only consider the correlation map, respectively the 
discrimination of the information theoretic word uncertainties. 

In the same way one obtains the much more interesting (uncertainty) discriminated symbol probabilities. 
Definition 7 (Discriminated Symbol Probabilities) The symbol probabilities discriminated by m are 

^ ^ „ pW(:E,M|m(i))P^'^(x,it|m(2)) 

uGU ueO OA ; I ) 



with 

,(0 



Remark 11 (Independence) Note that P® (x|m) is by Q independent of both 



(0 

W: . 



The discriminated symbol probabilities may be considered as commonly believed symbol probabilities 
under discriminated word uncertainties. 



To obtain a first intuitive understanding of this fact we relate P® {x\m) to the more accessible constituent 



symbol probabilities p'^\x\m^'-^). It holds by Bayes' theorem that 



with (abusing notation as this is not a probabihty) 

P^.{x\m) cx y ' \ -■ (14) 

In the logarithmic notation this gives 

Lf (m) = (m(i) ) + if ^ (m^^) ) - (m) + if (m) 
or in the extrinsic notation of (|9]l that 

Lf (m) = ^(m(i)) + Lf\m^^^) + U{r) + if (m). (15) 

Note first the similarity with ( fTOb . One has again a sum of the extrinsic beliefs, however, an additional 
value Lf{m) is added, which is by Remark [TT] necessarily independent of wf^ and I = 1,2. Overall the 
common beUef joins the two constituent behefs together with a "distance" correction term. 

Below we show that this new common belief is - under again practically prohibitively high complexity - 
just the real overall "belief", i.e., the correct symbol probabilities obtained by optimal symbol decoding. 

Definitions (Globally Maximal Discriminator) The discriminator m is globally maximal (for S) if 
|§(u|m)| — 1 for all u G U. I.e., for globally maximal discriminators exists a one-to-one correspon- 
dence between s and u and thus ISI = lUl. 
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Lemma 2 For a globally maximal discriminator m it holds that 

P®{u\m) = P'^''\u\r) andP®^{x\m) = P^f (x|r). 
I.e., the by m discriminated symbol probabilities are correct. 
Proof: Lemma [T| and Definition |5] give 

P[s,u\m) 

as u follows directly from s. For a globally maximal discriminator m exists a one-to-one correspondence 
between s and u This gives that one can omit for any probability either u or s. This proves the optimality 
of the discriminated distribution. 



For the overall symbol probabiUties holds 

With P^'^(a:,s,M|m(')) = P^' (s, M|m(')) for s € ^i{x) and P^'^^ (x, s, M|m(')) ^ for s ^ Si(a;) the 
right hand side becomes 

p(i)(g,-n|m(i))p(^)(s,M|m(^)) _^ Pff(a;,s,M|m(i))P^^^(x,g,M|m(^)) 
P(s,M|m) ''-^ Pc{x,s,u\m) 



By the one-to-one correspondence one can replace the sum over s by a sum over u to obtain 



, , _ ^ Pff (x,^,n|mW)P^^^(x,5,n|m(^)) 



Pc.(a:,s,«|m) 

which is (s can be omitted due to the one-to-one correspondence) the optimality of the discriminated 
symbol probabilities. 

A globally maximal discriminator m thus solves the problem of ML symbol by symbol decoding. Likewise 
by 

argmaxP®(M|m) = argmaxP^"-'(M|r) = argmaxP'-°''(s|r) 

the problem of ML word decoding is solved (provided the one-to-one correspondence of u and s can be 
easily inverted). 

This is not surprising as a globally maximal discriminator has by the one-to-one correspondence of s and u 
the discriminator complexity |1LJ| = |§|. The transfer complexity is then just the complexity of the optimal 
decoder based on constituent probabilities. 

Remark 12 {Globally Maximal Discriminators) The vector m = {r, ii;'^', 0) and Wj^^"* — 2* is an exam- 
ple of a globally maximal discriminator as ui(s) — X]"=i •^i^' is different for all values of s. I.e., there 
exists a one-to-one correspondence between s and u. Generally it is rather simple to construct a globally 
maximal discriminator E.g., the r received via an AWGN channel is usually already maximal discriminat- 
ing: The probability that two words s'^^\s'^^'> e S share the same real valued distance to the received word 
is generally Zero. 
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2. Local Discriminators 



In the last section the coupling of error correcting codes was reviewed and different decoders were dis- 
cussed. It was shown that an optimal decoding is due to the large representation complexity practically 
not feasible, but that a transfer of beliefs may lead to a good decoding algorithm. A generalisation of this 
approach led to the concept of discriminators and therewith to a new overall belief. The complexity of the 
computation of this belief is depending on |U|, i.e., the number of different outcomes u of the discrimina- 
tion. Finally it was shown that the obtained overall belief leads to the optimal overall decoding decision if 
the set is with |1LJ| = |S| maximally large. (However, then the overall decoding complexity is not reduced.) 

In this section we consider local discriminators with |U| ^ |S|. Then only a limited number of values 
need to be transferred to compute by (fTSl l a new overall belief P^\x\m). These discriminated beliefs 
P® (a;|m) may then be practically employed to improve iterative decoding. To do so we first show that 
local discriminators exist. 

Example 3 The r obtained by a transmission over a BSC is generally a local discriminator The map 
U{r) : s 1-^ u — (uo,0,0) is then only dependent on the HAMMING distance d// (r, s), i.e., 

Uoif) : s 1-^ uq = rs^ = n — 2dHir, s) 

and thus U = E([7o) Q {^n, —n + 2, n — 2, n}, which gives |U| < n + 1. This furthermore gives that 
an additional "hard decision" choice of the w^^'' will continue to yield a local "Hamming" discriminator 
m. 

To investigate local discrimination now reconsider the discriminated distributions. With Remark [TOl one 
obtains the following lemma. 

Lemma 3 The distributions of u given m are 

P{u\m) cx |S(M|m)| exp2(wo + ui + U2) 

and 

2 

p(')(M|m«) cx |C(')(M|m)|exp2( ^ Uk) 

where the sets S{u\m) and C'^^\u\m) are defined by M(tt|m) := {s G M : ui = ly'^'^S"^ V/}. 

Proof: By (fT2l l follows that the probability of all words s e S(M|m,) with the same u is equal and pro- 
portional to cxp2(X]fc=o ^fe)- |S(m|»ti)| words are in S(M|m) this gives the first equation. The second 
equation is obtained by adding the code constraint. 

Remark 13 (Overall Distribution) In the same way follows (see Remark|9ll that 

P^^^ulr) cx |C('')(M|m)|exp2(Mo)- (16) 

More general restrictions (see below) can always be handled by imposing restrictions on the considered 
sets. One thus generally obtains for the distributions of m a description via on u dependent sets sizes. 

Example 4 With the concept of set sizes Example[3]is continued. Assume again that the discriminator is 
given bym = (r, 0, 0). In this case no discrimination takes place on ui and 7i2 as one obtains lii ~ U2 ^ 
for all s. One first obtains the overall distribution P^°^(tt|r) to be with Remark [T3] the multiplication of 
exp2(Mo) with the distribution of the correlation cr^ with c G C*^"-* given by |C'^"^(M|m)|. 
Assume furthermore that the overall maximum likelihood decision c^"^ is with 

p(a)(£(a)|^) ^ 1 
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Figure 1 : Hard Decisions 



distinguished. This assumption gives that 



p('')(M|r) = p(°)(uo|r) 



( 



1 for uq = Mo(c'^°)) 
else. 



I.e., P*^") (wlr) consists of one peak. 

For the other probabiHties P{uo\m) and P*^'^ (uo|''tT'^''') with Lemma[3]again a multiplication of correlation 
distributions with exp2(uo) is obtained. These distributions will, however, due to the much larger spaces 



usually not be in the form of a single peak. Other words with uq > c^°-^r'^ may appear. The same 
then holds true for P®(iio|m). These considerations are exemplary depicted in Figure [T] Note that the 
distributions can all be computed (see Appendix lA.lb in the constituent trellises. 

For a local discrimination a computation in the constituent trellises produces by ( fT3] l symbol probabilities 
Pq. {x\m). In equivalence to (loopy) belief propagation these probabilities should lead to the definition of 
some w and thus to some iteration rule. Before considering this approach we evaluate the quality of the 
discriminated symbol probabilities. 

2.1. Typicality 

With Lemma|3]one obtains that the discriminated symbol probabilities defined by (fTlT l are 



with the sets C- (a;,M|m) defined by s e C*-'-* and Sj = x. Hence, P^,{x\m) only depends on the 
discriminated set sizes c[''\x, u\m), Eii{x, u\m), and the word probabilities P{s\r) oc exp2(uo(s)). 
The discriminated symbol probabilities should approximate the overall probabilities, i.e.. 



|§| » |c«| » id"^)! 



P«(a:|m)(x^ 



\C^^\x,u\m)\ exp2(uo + U2)\Cf\x, M|m)| exp2(-uo + ui) 
\Si{x, u\m)\ exp2(Mo + ui + U2) 




(17) 




\C^\x,u\m)\. 



(18) 
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Intuitively, the approximation thus uses the knowledge how many words of the same correlation values u 
and decision c; ~ x are in both codes simultaneously. Moreover, depending on the discriminator m the 
quality of this approximation will change. 

An average consideration of the approximations (fTsT i is related to the following lemma. 

Lemma 4 If the duals of the (linear) constituent codes do not share common words but the zero word then 

|C(i)||C(2)| = |§||C('^)|. (19) 

Proof: With Definition [3] and by assumption linearly independent i?^^^ and i?^^^ it holds that the dual 
code dimension of the coupled code is just the sum of the dual code dimension of the constituent codes, 
i.e., 

n-fc= (n-fc(i)) + (n-fc(2)). 
This is equivalent to k^^^ + fc^^^ = n + k and thus the statement of the lemma. 

This lemma extends to the constrained set sizes |cf as used in (fTSl l. The approximations are thus in 
the mean correct. 

For random coding and independently chosen m = {r,w'^^\w^^'>) this consideration can be put into a 
more precise form. 

Lemma 5 For random (long) codes C^^'^ and C^^^ and independently chosen m holds the asymptotic 
equality 

\C^\xMm)\ X |Cr^(^.;^l^)l|cf^(^.^l^)l . (20) 

\s>i[x , u\m)\ 

Proof: The probability of a random choice in § to be in §(tt|m) is just the fraction of the set sizes 
|§(M|m)| and 

For a random coupled code jC'^"' | the codewords are a random subset of the set |§|. For jC'^"-' | ^ 1 the law 
of large numbers thus gives the asymptotic equality 

\c[''\x,u\m)\ _ \S^{x,u\m)\ 

|cw| |§| ■ ^ ' 

The same holds true for the constituent codes 

|cf ■'(x, it|Tn)| |§i(a;, M|m)| 



|C(')| |§| 

A multiplication of the equality of code 1 with the one of code 2 gives the asymptotic equivalence 

|§| \c['\x,u\m)\\cf\x,u\m)\ _ |§, {x, u\m)\ 
|C(i)||C(2)| Mx,u\m)\ ^ \S\ ■ 

Combining dSTT i and (l22l i then leads to 

|§| |cf)(x,M|m)||cfHx,M|m)| _ \c'f\x,u\m)\ 



(22) 



|C(i)||C(2)| \S^ix,u\m)\ |CW| 

With ( IT9] ) this is the statement of the lemma. 

Remark 14 (Randomness) The proof of the lemma indicates that the approximation is rather good for 
code choices that are independent of m. I.e., perfect randomness of the codes is generally not needed. This 
can be understood by the concept of random codes in information theory. A random code is generally a 
good code. Conversely a good code should not exhibit any structure, i.e., it behaves as a random code. 
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2.2. Distinguished Words 



The received vector r is obtained from the channel and the encoding. The discriminator m is due to the 
dependent r thus generally not independent of the encoding. This becomes directly clear by reconsidering 
Example|4]and the assumptions that a distinguished word S°''> with 

p(a)(^(a)|^) ^ 1 

exists. In this case the constituent distributions and thus likewise the discriminator distribution P'^{u\r) 
will be large in a region where a "typical" number of errors t occurred, i.e, uq — rcF k, n — 2t. 

For an independent m, however, this would not be the case: Then P®{u\m) would with Lemma|5]be 
large in the vicinity of a typical minimal overall code word distance. This distance is generally larger than 
the expected number of errors t under a distinguished word. Hence, P®{u\m) would then be large at a 
smaller uq than under a dependent m. 

Remark 15 {Channel Capacity and Typical Sets) The existence of a distinguished word is equivalent to 
assuming a long random code of rate below capacity [18 1. The word sent is then the only one in the typical 
set, i.e., it has a small distance to r. The other words of a random code wiU typically exhibit a large distance 
to r. 



To describe single words one needs to describe how well certain environments in u given m are discrim- 
inated. The precision of the approximation of C^"'' (x, u\m) by ( fTsT i hereby obviously depends on the set 
size |Si(a;, M|m)|. This leads to the following definition. 

Definition 9 (Maximally Discriminated Region) The by m maximally discriminated region 

P(m) := y S{u\m) 

|S(u|rrt)| = l 

consists of all words s that uniquely define u with ui — sw^^^^ for / = 0, 1, 2. 

Tlieorem 2 For independent constituent codes and a by c.^°'^ £ D(m) maximally discriminated distin- 
guished event is 

(x|m) X P<lf (x|r) andP'^iulm) x P^^Hulr). 

Proof: It holds with ([TtTi that 

For the distinguished event c^"-^ G C^''-' it follows that 



|§,(a;,w(c(-))|m)| 

as by assumption c*^°) e D(m) is maximally discriminated, which gives by definition and c^^^ G C^'^ for 
I = 1,2 that 

|§,(2;,it(c(°))|m)| = |cf^(:E,w(c("))|m)| = |cf ^(x,ix(c(°))|m)| = 1. 
I.e., the term with u — m(c*^°)) in ( |23] | is correctly estimated. 

The other terms in ( |23]) represent non distinguished words and can (with the assumption of independent 
constituent codes) be considered to be independent of m. This gives that they can be assumed to be 
obtained by random coding. I.e., for 

u ^ m(c(")) with u/(c(")) = «;(')c(°)'^ 
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holds 

\C'i'\x,u\m)\\cf\x,u\m)\ _ (,) 

Mx,u\m)\ ^^^^ ^'^'''l"'^! 

of Lemma|5] Hence the other words are (asymptotically) correctly estimated, too. 

Moreover, with (l23T l one obtains for c^°-'> a probability value proportional to exp2(rc'^")^). The other terms 
of ( l23T l are much smaller: An independent random code typically does not exhibit code words of small 
distance to r. As the code rate is below capacity then (m(c'") ) I"^) exceeds the sum of the probabilities 
of the other words. Asymptotically by ([TtT i both the overall symbol probabilities and the overall distribution 
of correlations follow. 

Remark 16 (Distance) Note that the multiplication with exp2(wo) in (l23T l excludes elements that are not 
in the distinguished set (= with large distance to r). These words can - as shown by information theory 
- not dominate (a random code) in probability. I.e., a maximal discrimination of non typical words will 
not significantly change the discriminated symbol probabilities (a;|m). This indicates that a random 
choice of the ii;''' for I = 1,2 will typically lead to similar beliefs {x\m) as under ii;'^' ~ ii;'^) = 0. 
Conversely it holds that if one code word at a small distance is maximally discriminated then its probability 
typically dominates the probabilities of the other terms in (|23] |. 

Example 5 We continue the example above. The discriminator m — {r, c'"' , 0) maximally discriminates 
the distinguished word c^") at 

u = m(c(°)) = (n - 2dH{r, c^^^), n, 0). 
The discriminator complexity U is maximally {n + 1)^ as only this many different values of 

u = {n-2dH{r,c),n-2dH{c'^''\c),0) 

exist. The complexity is then given by the computation of maximally {n + 1)^ elements. As this has to be 
done n times in the trellis (see Appendix lA.ll i the asymptotic complexity becomes 0{n^) (for fixed trellis 
state complexity). The computation will give by Theorem|2]that 

P® (a;|m) x P(l,f (a;|r) with c^'^ = sign(Lf (m)) 

as c'"^) is distinguished and as all other words can be assumed to be chosen independently. 

I.e., P^{u\m) exhibits a peak of height 1 and the P^,{x\m) give the asymptotically correct symbol 

probabilities. 



2.3. Well Defined Discriminators 

Example|5]shows that for the distinguished event c^°-'> the hard decision discriminator 

m = (r, c(''\ 0) with c|"^ = sign(Lf (m)) 

produces discriminated symbol beliefs close to the overall symbol probabilities. The discriminator com- 
plexity |1LJ| < {n + 1)^ is thus sufficient to obtain the asymptotically correct decoding decision. 

Remark 17 (Equivalent Hard Decision Discriminators) By ( l23T l the hard decision discriminators 
m — {r, w, 0), m — (r, 0, w), and m — {r, w, w) 

are equivalent: For the three cases the same cf\x\m) and §|'-'(a;|m) and thus P^ {x\m) follow. In the 
sequel of this section we will (for symmetry reasons) only consider the discriminators m = (r,w,w). 

The discussion above shows that a discriminator with randomly chosen w should give almost the same 
Lf(m) as Lf{r, 0, 0). If, however, the discriminator is strongly dependent on the distinguished solution, 
i.e., w = c'"-' then the correct solution is found via Lf (m). This gives the following definition and lemma. 
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Definition 10 (Well Defined Discriminator) A well defined discriminator m — {r,w, w) fulfils 

= sign(Lf (m)) for all i. (24) 

Lemma 6 For a BSC and distinguished c*^"-* exists a well defined discriminator m — {t,w, w) with 
Wi, ri S B such that cf''^ — Wi. 

Proof: Set m — (r, c(°\ c*^"'). For this choice holds S'^^ G D(m) and thus with Theorem|2] asymptotic 
equality. Moreover, holds for a distinguished element that 

and thus c'f^ = sign(L|°V)) = sign(Lf (m)). 

The definition of a well defined discriminator (l24l i can be used as an iteration rule, which gives Algorithm|2] 
The iteration thereby exhibits by Lemma |6] a fixed point, which provably represents the distinguished 
solution. Note that the employment of w'-^'> = it)^^' = ti; is here handy as by Lf{m) only one common 
belief is available. This is contrast to Algorithm [T] where the employment of the two constituent behefs 
generally give that w^^^ ^ w'-^K 

Algorithm 2 Iterative Hard Decision Discrimination 

1. Set m = (r, 0, 0) and w = 0. 

2. Set V ~ w and Wi ^ sign {Lf{m)) for all i. 

3. If V w then m = (r, w, w) and go to 2. 

4. Set w. 



To understand the overall properties of the algorithm one needs to consider its convergence properties and 
the existence of other fixed points. A first intuitive assessment of the algorithm is as follows. The decisions 
taken by Wi — sign(L®(r, 0, 0)) should by (fTSl l lead to a smaller symbol error probability than the one 
over r. Overall these decisions are based on P^{u\r, 0, 0). This distribution is necessarily large in the 
vicinity of uq = n — 2t with t the expected number of errors. 

The subsequent discrimination with w and r will consider the vicinity of c more precisely if wc^ is larger 
than rc^: In this vicinity less words exist, which gives that the |§(M|m)| are smaller there. Smaller error 
probability in w is thus with (fTTT l typically equivalent to a better discrimination in the vicinity of c'"'. This 
indicates that the discriminator (r, w, w) is better than (r, 0, 0). Hence, the new Wi ^ sign(i® (r, w, w)) 
should exhibit again smaller error probability and so forth. If the iteration ends then a stable solution is 
found. Finally, the solution w = c^"^ is stable. This behaviour is exemplary depicted in Figure |2] where 



P®(m|h 



-2? 



(a) Initialisation 



P®(«|m) lii 



P°{u\m) 



n — 2t Uq 

(b) Intermediate Step 



n — 2l Uq 

(c) Stable 



Figure 2: Hard Decision Discrimination 



the density of the squares represent the probability {u\m) ofu = (uq, ui). 
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2.4. Cross Entropy 

To obtain a quantitative assessment of Algorithm|2]we use the following definition. 
Definition 11 ( Cross Entropy) The cross entropy 

H{C\w\\r) := Ec[H{s\w)\r] = -J^ Pcis\r) log2(P(s|-u;)) 

sec 

is the expectation of the uncertainty H{s\w) — — logj P{s\w) under r and c G E(C). 

The cross entropy measures as the Kullback-Leibler Distance 

D{C\w\\r) := H{C\w\\r) - H{C\r) 

with 

H(C\r) ■.= ^c[Hc{s\r)\r] = -Y,Pc{s\r)\og^(Pc{s\r)). 

sec 

the similarity between the distributions P{c\r) and P{s\w). By Jensen's inequality it is easy to show ifTSl 
that D{C\w\\r) > and thus 

H{C\w\\r) > H{C\r) > 0. 

The entropy H{C\r) is an information theoretic measure of the number of probable words in E(C) under 
r. To better explain the cross entropy H{C\w\\r) we shortly review some results regarding the entropy. 

The typical set A„e(C|r) is given by the typical region 

A„,(C|r) = {c e E(C) : \H{c\r) - H{C\r)\ < ne} 

of word uncertainties 

if(c|r) = -log2P(c|r). 

This definition directly gives 

1> J2 exp2(-i?(c|r))>exp2(-i/(C|r)-n£) ^ 1, 

A„e(C|r) A„^(C|7-) A„^(C\r) 

respectively, 

Pc(A„,(C|r)|r) - ^ P{c\r) = ^ cxp^{-H{c\r)) < exp^{-H{C\r) + ne) ^ 1. 

A„e(C|r) A„^(C|r) A„^{C\r) 

With 

^ l = |A„,(C|r)| 

A„,(C|r) 

this leads to the bounds on the logarithmic set sizes 

H{C\r) + ne> \og^ |A„aC|r)| > H{C\r) + \og^{Pc{A,,,{C\r)\r)) - ne 
by the entropy. For many independent events in r the law of large numbers gives for e > that 
Pc(A„,(C|r)|r) « 1 and thus i/(C|r) « log2(|A„aC|r)|). 

We investigate if a similar statement can be done for the cross entropy. To do so first a cross typical set 
Ane{C\w\\r) is defined by the region of typical word uncertainties: 

min{H{S\w),H{C\w\\r)) - ne < H{s\w) < max(i?(S'|u;), H{C\w\\r)) + ne. (25) 
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I.e., the region spans the typical set in w but includes more words if H{S\w) 7^ H{C\w\\r). As the 
typical set in w is included this gives for large n that 

Ps{AneiC\w\\r)\w) ^ 1 

and then in the same way as above the bounds on the logarithmic set size 

insiyi{H {S\w), H{C\w\\r)) + ne > log^ |A„e(C|i(;||r)| > mm{H{S\w), H{C\w\\r)) - ne. 

Moreover, holds by the definition of the cross entropy and the law of large numbers that typically 

Pc{Ane{C\w\\r)\r)^l 

is true, too. This gives that the cross typical set includes the typical sets AneiS\w) and Ane{C\r), i.e., 

A„,{C\w\\r) D A,,,{S\w) siud Ane{C\w\\r)) D Ane{C\r). (26) 

If one wants to define a transfer vector w based on r one is thus interested to obtain a representation in w 
such that the logarithmic set size 

log2 \AneiC\w\\r)\ < max{H {S\w) , H {C\w\\r)) + US 

is as small as possible. 

In the sequel we consider P{s\w) cx P{w\s) defined by Q. This probability is given by 



The cross entropy thus becomes 

n 

H{C\w\\r) = ^<^(^l^) E(l°g2(2'"' + 2--') - s,w.,) 
sec i^i 

n 71 

= J2 log2(2"' + 2-"'0 Pc{^\r)s^w, (28) 

4=1 1=1 see 

and 

n n n 

^^Pc(^|r)s,:u;. = 5^Ec,c,Nr]u;. = 5]u;.(P<l';)(+l|r) ~ P(f (~l|r)). 
i=i sec 1=1 1=1 

This definition almost directly defines an optimal transfer 

Lemma 7 Equal logarithmic symbol probability ratios 



w, = U{w) = ifir) = -log2 — 



1^ P^f(+l|r) 
2 °'^p(f)(-l|r) 



and P{s\w) cx P{w\s) defined by minimise cross entropy H{C\w\\r) and KullbACK-Leibler 
distance D{C\w\\r). 

Proof: First it holds by dSj that 

Tlx 1 , exp2(+W4) 

w, = L,{w) = - log2 -. 

2 exp2(-Wi) 

A differentiation of (l28T l leads to 



9m 



■if(C|t(;||r) = Pc (s|r)(tanh2K) - a;,) = tanh2K) - ^ c,Pc(s|r) = 
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with tanh2(x) (2^ - 2-^)/(2=^ + 2-^). This directly gives that 

sec 

As 

tanh2(Lf)(r)) = P^f (+l|r) - P^f (-l|r) 
and ■^^H{C\r) = this is equivalent to the statement of the lemma. 

I.e., the definition of w by Li{w) = L'f\r) is a consequence of the independence assumption Espe- 
cially interesting is that Li{w) = L'f\r) directly implies that 

H{S\w) ^H{C\w\\r), 

which gives with (|26] | that 

A.ne{C\w\\r) = h.ne{S\w) and thus h.ne{S\w) D Ane{C\r). 

A belief representing transfer vector w thus typically describes all probable codewords. 

By reconsidering the definition of the cross typical set in dZSl l the in r and C typical set Ane{C\r) is (in 

the mean) contained in the set of in w probable words s e § if 

H{S\w) > H{C\w\\r). 

Hereby the set of probable words is defined by only considering the right hand side inequality of ( l25T l. 

2.5. Discriminator Entropy 

In this section the considerations are extended to the discrimination. To do so we use in equivalence to (l28l l 
the following definition. 

Definition 12 (Discriminated Cross Entropy) The discriminated cross entropy is 

n 

H{C®\w\\m) : = - ^ P®{u\m) \og^ P{s\w) ^ log2(2"'' + 2-'^0 ^ ' ^cM^ "1^) 

n 

= 5]log2(2"'' +2-"'') - w,E$[c,\m] 
with (EZl flnt/E® [Q|m] = P® (+l|m) - P^^(-l|m). 

Note that this definition again uses the correspondence of u and s. Even though by a discrimination not 
all words are independently considered, a word uncertainty consideration is still possible by attributing 
appropriate probabilities. Lemma|7]directly gives that the discriminated cross entropy is always larger than 
or equal to the discriminated symbol entropy H{C®\\m), i.e., 

H{C®\w\\m) > - ^ P®{u\m) \og^ P(s|i®(m)) =: H{C®\\m) 

The discriminator entropy measures the uncertainty of the discriminated decoding decision, i.e., the number 
of words in § that need to be considered. This directly gives the following theorem. 

Theorem 3 The decoding problem for a distinguished word is equivalent to the solution of 

=sign(Lf(m)) (29) 
with the discriminated symbol entropy iJ (C® || m) < 1 and m — {r,w, w). 
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Proof: For Wi — cf' is c^"' G D(m). This gives with Lemma|6]for the discriminated distribution that 

P®{u\m) X P'^''\u\r). 

As cS""' is a distinguished solution this gives p(a)(M(c(''))|r) « lorequivalentlyi?(C®||m) w 0. 

The discriminated symbol entropy H {C®\\m) estimates by exp2 H(C®\\m) the logarithmic number of 
elements in the set of probable words in §. Any solution m with 

H{C®\\m) < 1 

thus exhibits one word s with P^{u\Tn) w 1. I.e., one has a discriminated distribution P^{u\m) that 
contains just one peak of height almost one at u. As only one word distributes the decisions by (|29] | give 
this word, or equivalently that u = u{w). Hence, c — w is maximally discriminated. 

This directly implies that the obtained c needs to be a codeword of the coupled code: Both distributions 
are used for the single word description P®(it|m) 7^ 0. Hence, both codes contain the in u 
maximally discriminated word c, which gives (by the definition of the dually coupled code) that this word 
is an overall codeword. 

Assume that c 7^ c'") represents a non distinguished word. With Remark [16] this word needs to exhibit a 
large distance to r. Typically many words c e C^"-' exist at such a large distance. By SHI these words 
are considered in the computation of P'^{u\m). Thus (M|m) is not in the form of a peek, which gives 
that iJ(C®||m) > 1. As this is a contradiction no other solution of (|29] l w but w = c^"^ may exhibit a 
discriminated symbol entropy i/(C®||m) < 1. 

Remark 18 (Typical Decoding) The proof of the theorem indicates that any code word c e C^°^ with 
small distance to r may give rise to a well defined discriminator m with iJ(C®||m) < 1 and w — c. 
Hence, a low entropy solution of the equation is not equivalent to ML decoding. However, if the code rate 
is below capacity and a long code is employed only one distinguished word exists. 

Theorem [3] gives that Algorithm |2] fails in finding the distinguished word if either the stopping criterion 
is never fulfilled (it runs infinitely long) or the solution exhibits a large discriminated symbol entropy. To 
investigate these cases consider the following Lemma. 

Lemma 8 It holds that 

w, ^ sign(Lf (m)) for all i (30) 
minimises the cross entropy H{C'^\w\\m) under the constraint Wi G B. 

Proof: The cross entropy H {C®\w\\m) is given by 

n 

iI(C®|«;||m) = ^log2(2"'' + 2-'"') - u;, • tanh(Lf (m)). 

i=l 

The cross entropy is under constant \'Wi \ or Wi E B obviously minimal for 

sign (w^ ■ tanh(Lf (m))) = 1, 

which is the statement of the Lemma. 

The algorithm fails if the iteration does not converge. However, the lemma gives that (|30] | minimises in 
each step of the iteration the cross entropy towards w. This is equivalent to 

iJ(C®|m||m) > H{C^'^\w\\m). 

This cross entropy is with H{C'^\w\\m) > min^, H{C'^\v\\m) — iJ(C®||m) always larger than the 
overall discriminated symbol entropy. Furthermore, holds by the optimisation rule that 

H{S\w) > HiC^Mm) > H{C^\\m), 
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which gives that the typical set under the discrimination remains included. The subsequent step will there- 
fore continue to consider this set. If the discriminated cross entropy does not further decrease one thus 
obtains the same w, which is a fixed point. 

This observation is similar to the discussion above. A discriminator m describes environments with words 
close to r. A minimisation of the cross entropy can be considered as an optimal description of this envi- 
ronment under the independence assumption (and the imposed hard decision constraint). If this knowledge 
is processed iteratively then these environments should be better and better investigated. The discriminated 
symbol entropy i/(C® ||m) will thus typically decrease. For an infinite loop this is not fulfilled, i.e., such 
a loop is unlikely or non typical. 

Moreover, the iterative algorithm fails if a stable solution Wi = sign{Lf (m)) with to 7^ c is found. These 
solutions exhibit with the proof of Theorem[3]large discriminated symbol entropy H{C'^\\m) (many words 
are probable) and thus small \Lf{m)\. However, solutions with small |Lf (m)| seem unlikely as these 
values are usually for w ~ already relatively large and Lemma[8]indicates that these values will become 
larger in each step. 

Remark 19 (Improvements) If the algorithm fails due to a well defined discriminator of large cross entropy 
then an appropriately chosen increase of the discriminator complexity should improve the algorithm. To 
increase the discrimination complexity under hard decisions one may use discriminators w^^^ ^ w^'^\ 
One possibility is hereby to reuse the old transfer vector by 

= 1^(1) andwf) = sign(Lf (m)) 

in Step 2 of the iterative algorithm. The complexity of the algorithm will then, however, increase to O(n^). 

On the other hand the complexity can be strongly decreased without loosing the possibility to maximally 
discriminate the distinguished word. First holds that only (distinguished) words up to some distance t from 
the received word contribute to X®(m). One may thus decrease (if full discrimination of the distance to 
w is used) the complexity < t ■ {n + 1) if only those values are computed. 

A further reduction is obtained by the use of erasures in w, i.e., hy Wi G { — 1, 0, +1} and 

_ sign(Lf (m)) -rj 
- 2 

in Step 2 of the algorithm. Note that this discrimination has only complexity |U| < as ui (c) < ww^ < t 
is typically fulfilled. 

It remains to show that the distinguished solution is stable. We do this here with the informal proof: For 
Ci = sign(Lf'(m)) one obtains that Wi = Ci if Ci ^ and else. Hence one obtains that ui{c) — rc^ 
and Ml (c) = ww^ . If only one word s exists for these values ui then this discriminator {r,w,w)\s, surely 
maximally discriminating. First holds that if ui(c) — ww^ that then Sj — Wi for is uniquely defined 
Wi ^ 0. Under ui(c) then the other symbols are uniquely defined to Si — Vi as wi(c) = rcF is the unique 
maximum of ui {s) under Si = Wi for Wi ^ 0. 

The overall complexity of this algorithm is thus smaller than 0{n-t^ ) respectively 0{n ■ t^) for a discrim- 
ination with to 7^ w'-^\ 

3. Approximations 

The last section indicates that an iterative algorithm with discriminated symbol probabilities should out- 
perform the iterative propagation of only the constituent beliefs. However, the discriminator approach was 
restricted to problems with small discriminator complexity |1U|. 

In this form and Remark[T2]the algorithm does not apply for example to AWGN channels. In this section 
discriminator based decoding is generalised to real valued ii;'^'^ and r, and hence generally |1LJ| = |S|. 
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For a prohibitively large discriminator complexity |U| the distributions P^'{x, M|m^'^) can not be practi- 
cally computed; only an approximation is feasible. This approximation is usually done via a probability 
density, i.e., 

where pp'' (a;, Mjm''^ ) is described by a small number of parameters. 

Remark 20 {Representation and Approximation) The use of an approximation changes the premise com- 
pared to the last section. There we assumed that the representation complexity of the discriminator is 
limited but that the computation is perfect. In this section we assume that the discriminator is generally 
globally maximal but that an approximation is sufficient. 

An estimation of a distribution may be performed by a histogram given by the rule 

Ue(M|m) IJ U{v\m) 

and the quantisation e. These values can be approximated (see Appendix lA. Il l with an algorithm that ex- 
hibits a comparable complexity as the one for the computation of the hard decision values. For a sufficiently 
small e one obviously obtains a sufficient approximation. Here the complexity remains of the order 0{n^). 
It may, however, be reduced as in Remark[T9l 

Remark 21 (Uncertainty and Distance) The approach with histograms is equivalent to assuming that 
words with similar u do not need to be distinguished; a discrimination of s^^^ and s^"^^ is assumed to 
be not necessary if the "uncertainty distance" 

2 
1=0 

of s*^^^ and s'^^^ is smaller than some e. The error that occurs by 

{x\m) oc / "^^'^ ' ' , ^ ; ' -du 

V 

can for sufficiently small e usually be neglected. 

Note that another approach is to approximate only in uq and continue to use a limited discriminator com- 
plexity in w (by for example hard decision Wi E B), which gives an exact discrimination in ui. 

3.1. Gauss Discriminators 

Distributions are usually represented via parameters defined by expectations. This is done as the law of 
large numbers shows that these expectations can be computed out of a statistics. Given these values then 
the unknown distributions may be approximated by maximum entropy [91 densities. 

Example 6 The simplest method to approximate distributions by probability densities is to assume that 
no extra knowledge is available over u. This leads to the maximal entropy "distributions" (In Bayes' 
estimation theory this is equivalent to a non proper prior) with stripped u 

P^'^(x|m(')) « F^'^(a;,M|m(')) andPc,(a;|m) « Pc,ix,u\m), 
which is equivalent to Lf{m) = as then P^, {u\x, m) = 1 or 

if (m) = + LW(mW) + if )(m(2)) 

and thus implicitly to Algorithm [T] Note that the derived tools do not give rise to a further evaluation of 
this approach: A discrimination in the sense defined above does not take place. 



1=0 

(31) 
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The additional expectations considered here are the mean values /x/ and the correlations ^ . These are for 
the given correlation map 

n 

Ul{s) = Y,4'''s^ (32) 

and I Si I — 1 defined by 

n n 



and 



i=l i=l 



n n 

The complexity of the computation of each value fij^^ and (jJf'l is here (see Appendix lA.ll l comparable to 
the complexity of the BCJR Algorithm, i.e., for fixed trellis state complexity 0{n). 

For known mean values and variances the maximum entropy density is the Gauss density. This density is 
with the following Lemma especially suited for discriminator based decoding. 

Lemma 9 For long codes with small trellis complexity one obtains asymptotically a Gauss density for 
P(')(tt|m(')) andP{u\m). 



Proof: The values u;(c) are obtained by the correlation given in (|32] |. For P{u\m) this is equivalent to 
a sum of independent random values. I.e., P{u\m) is by the central limit theorem GAUSS distributed. 
For long codes with small trellis state complexity and many considered words the same holds true for 
p('')(M|m('*)). In this case the limited code memory gives sufficiently many independent regions of sub- 
sequent code symbols. I.e., the correlation again leads to a sum of many independent random values. 

Remark 22 (Notation) The Gauss approximated symbol probability distributions are here denoted by 
a hat, i.e., p"(jX;x,u\m) and pCi{x^u\m). The same is done for the approximated logarithmic symbol 
probability ratios. 

The constituent Gauss approximations then imply the approximation of P®{u\m) by 

„^ p(^)(n|m(^))pg^(n|m(i)) 
^ ("1"^) " W^) 

and thus approximated discriminated symbol probabilities (for the computation see Appendix lA.2b 



P^{x\m)<x / ^^'^ ' ' , ^ ; ' -du. (33) 



This approximation is obtained via other approximations. Its quality can thus not be guaranteed as before. 
To use the approximated discriminated symbol probabilities in an iteration one therefore first has to check 
the validity of ( |33] |. 
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By some choice of w^^^ and ii;'^^ the approximations of the constituent distribution are performed in 
an environment A„e(C^'-'|Tn*^'^) where p*^'' (iilm''^) is large. The overall considered region is given by 
Ane{S\m) defined by p(M|m). This overall region represents the possible overall words. The approxima- 
tion is surely valid if the possible code words of the ^-th constituent code under m^'^ are included in this 
region. I.e., the conditions 

A„,(CW|m''^) C A„,(5|m) (34) 

for 1 — 1,2 have to be fulfilled. In this case the description of the last section apphes as then the approxi- 
mation is typically good. 

Remark 23 That this consideration is necessary becomes clear under the assumption that the constituent 
Gauss approximations do not consider the same environments. In this case their mean values strongly 
differ. The approximation of the discriminated distribution, however, will therefore consider regions with a 
large distance to the mean. The obtained results are then not predictable as a Gauss approximation is only 
good for the words assumed to be probable, i.e., close to its mean value. Under (|34] | this can not happen. 

The condition (|34] | is - in respect to the set sizes - fulfilled if 

H{S\m) > iJ(C'"|m||m(')) 

as this is equivalent to 

which gives with ( |26] | that 

AneiS\m) D (A„e(C('Vll^^'^) H C(')) D A„e(C('V''')- 
However, by ( |34] | not only the set sizes but also the words need to match. With 

n 
i=0 

we therefore employ the symbol wise conditions 

H{C,\n + wP + u.f > |r< + + w'f ^||m(')). (35) 

As all symbols are independently considered the conditions ( |34] | are then typically fulfilled. 

A decoding decision is again found if H{C^\\m) < 1 under (l35T l. To find such a solution we propose to 
minimise in each step H{C^\v\\Tn) under the condition (l35T l of code /. 

As then (|35] | is fulfilled the obtained set of probable words remains in the region of common beliefs, which 
guarantees the validity of the subsequent approximation. This optimised v is then used to update ii;'^'^ 
under fixed w'^'^^ and h ^l. 

This gives Algorithm[3] Consider first the constrained optimisation in Step 3 of Algorithmic The definition 
of the cross entropy 

transforms the constraint to Vi tanh2(ui) < Vi tanh2(ivf '(m'^'')), which is equivalent to 

l^^^l < \Lf\m^H andsign(i;0 = sign(Lf )(m«)). 
Moreover, the optimisation H{C®\v\\m) min without constraint gives Vi = Lf{m). 
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Algorithm 3 Iteration with Approximated Discrimination 

1. Set = = 0. Set Z = 2 and /i = 1. 

2. Swap I and h. Set z = w^''^ 

3. Setu such that 

H{C^\v\\m) ^min 
under H{Ci\vi) > if(C|'Vill^''^) for all i. 

4. Seti«(') = v- w^'*) - r. 

5. If lu*') 7^ z then go to Step 2. 

6. Set Ci = sign(wi) for all i. 



This consideration directly gives the following cases: 

• If this Vi does not violate the constraint then it is already optimal. 

• It violates the constraint if 

sign(Lf)(mW))^sign(if(m)). 
In this case one has to set Vi = 0to fulfil the constraint. 

• For the remaining case that sign(L|'^(m('')) = sign(Lf (m)) but that the constraint is violated by 

|Lf(m«)|<|Lf(m)| 

the optimal solution is 

as the cross entropy H{Cf\vi\\m) is between Vi = and Vi = Lf{m) a strictly monotonous 
function. 

The obtained are thus given by either if (m), if ^(mC)), or Zero. The zero value is hereby obtained 
if the two estimated symbol decisions mutually contradict each other, which is a rather intuitive result. 
Moreover, note that the constrained optimisation is symmetric, i.e., it is equivalent to 

H{d/^\v,\\m''^'^) ^min under i?(C^|w^) > 7J(Cf ||m) for alH. (36) 

Remark 24 (Higher Order Moments) By the central hmit theorem higher order moments do not signifi- 
cantly improve the approximation. This statement is surprising as the knowledge of all moments leads to 
perfect knowledge of the distribution and thus to globally maximal discrimination. However, the statement 
just indicates that one would need a large number of higher order moments to obtain additional useful 
information about the distributions. 

3.1.1. Convergence 

At the beginning of the algorithm many words are considered and a GAUSS approximation surely suffices. 
I.e., in this case an approximation by histograms would not produce significantly different results. The con- 
vergence properties should thus at the beginning be comparable to an algorithm that uses a discrimination 
via histograms. However, there a sufficiently small e should give good convergence properties. 
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At the end of the algorithm typically only few words remain to be considered. For this case the Gauss 
approximation is surely outperformed by the use of histograms. Note, that this observation does not con- 
tradict the statement of Lemma |9] as we there implicitly assumed "enough" entropy. Intuitively, however, 
this case is simpler to solve, which implies that the Gauss approximation should remain sufficient. 

This becomes clear by reconsidering the region Ane{S\Tn) that is employed in each step of the algorithm. 
An algorithm that uses histograms will outperform an algorithm with a Gauss approximation if different 
independent regions in Ane{S\m) become probable. A Gauss approximation expects a connected region 
and will thus span over these regions. I.e., the error of the approximation will lead to a larger number of 
words that need to be considered. However, this should not have a significant impact on the convergence 
properties. 

Typically the number of words to be considered will thus become smaller in any step: The iterative algo- 
rithm gives that in every step the discriminated cross entropy (see Definition [T2]| 

i?(C®|m(")||m(°)) j5®(M|m(°))log2P(s|m("))dM 

u 

is smaller than the discriminated symbol entropy //(C®||m^°^) under the assumed prior discrimination 
m'^°\ Hence the algorithm should converge to some fixed point. 



3.1.2. Fixed Points 

The considerations above give that the algorithm will typically not stay in an infinite loop and thus end at a 
fixed point. Moreover, at this fixed point the additional constraints will be fulfilled. It remains to consider 
whether the additional constraints introduce solutions of large discriminated symbol entropy H{C® \ \ m) . 

Intuitively the additionally imposed constraints seem not less restrictive than the use of histograms and 
= u;(2) as w'^^^ ^ w'^^'> impHcs a better discrimination. However, solutions with large discriminated 
symbol entropy H{C'^\\m) will even for the second case typically not exist. Moreover, the discrimination 
uses continuous values, which should be better than the again sufficient hard decision discrimination. I.e., 
the constraint should have only a small (negative) impact on the intermediate steps of the algorithm. 

Usually H{C^\\m) is already at the start (ii;'^^ ~ w'-^^ — 0) relatively small. The subsequent step will 
despite the constraint typically exhibit a smaller discriminated symbol entropy. This is equivalent to a 
smaller error probability and hence typically a better discrimination of the distinguished word. 

If the process stalls for 

HiC^Wm) > 1 

then the by m investigated region either exhibits no or multiple typical words. As (typically) the distin- 
guished word is the only code word in the typical set and as the typical set is (typically) included this 
typically does not occur 

Finally, for _ff(C®||m) w the distinguished solution is found. At the end of the algorithm (and the as- 
sumption of a distinguished solution) the obtained Gauss discriminated distribution then mimics a Gauss 
approximation of the overall distribution, i.e., 

p'^iulm) cx - — ^-^ — ^, , , ' X r). (37) 

p{u\mj 

Without the constraints many solutions m exist. It only has to be guaranteed that the constituent sets 
intersect at the distinguished word. By the addition of the constraints the solution becomes unique and is 
defined such that the number of by p{u\m) considered words is as small as possible. 

This generally implies that then both constituent approximated distributions need to be rather similar This 
is the desired behaviour as the considered environment is defined by a narrow peak of p("'(M|r) around 
ti(c). Hence, the additional constraints seem needed for a defined fixed point and the limitations of the 
Gauss approximation. This emphasises the statement above: Without the constraint non predictable be- 
haviour may occur. 
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Remark 25 (Optimality) The values w\ are continuous. Thus, one can search for the optimum of 

iJ(C®||m) ^ min 

by a differentiation of H{C^\m). For the differentiation holds 

2 ^——j^ — = tanh2 L,{m) - tanha (m) - 2 / ^ ^ ' ^ log2 P{s\m)du. 

For the first term see Lemma |7] and the definition of the discriminated symbol probabilities. The second 
term is the derivation of the discriminated probability density. For the case of a maximal discrimination of 
the distinguished word it will consist of this word with probability of almost one. A differential variation 
of the discriminator should remain maximally discriminating, which gives that the second term should be 
almost zero. Hence, one obtains that 

U{m) « if (m) (38) 

holds at the absolute minimum of if(C®||m), which is a (soft decision) well defined discriminator. Note, 
furthermore, that for (|37] | and similar constituent distributions the distribution p(u\rn) will necessarily be 
similar to jP^'^{u\rn), which is a similar statement as in ( |38] |. 

Remark 26 {Complexity) The decoding complexity is under the assumption of fast convergence of the 
order 0(n). I.e., the complexity only depends on the BCJR decoding complexity of the constituent codes. 
Moreover, Algorithmic] can still be considered as an algorithm where parameters are transferred between 
the codes. Hereby the number of parameters is increased by a factor of nineteen (for each i additionally to 
wi' for X — ±1 three means and six correlations). 

Note, finally, that the original iterative (constituent) belief propagation algorithm is rather close to the 
proposed algorithm. Only by ( l36b an additional constraint is introduced. Without the constraint apparently 
too strong beliefs are transmitted. Algorithm[3]cuts off excess constituent code belief. 



3.2. Multiple Coupling 

Dually coupled codes constructed by just two constituent codes (with simple trellises) are not necessarily 
good codes. This can be understood by the necessity of simple constituent trellises. This gives that the left- 
right (minimal row span) [11] forms of the (permuted) parity check matrices have short effective lengths. 
This gives that the codes cannot be considered as purely random as this condition strongly limits the 
choice of codes. However, to obtain asymptotically good codes one generally needs that the codes can be 
considered as random. 

If - as in Remark[3]- more constituent codes are considered, then the dual codes will have smaller rate and 
thus a larger effective length. This is best understood in the limit, i.e., the case of n — fc constituent codes 
with n — fc*^'' = 1. These codes can then be freely chosen without changing the complexity, which leaves 
no restriction on the choice of the overall code. 

For a setup of a dual coupling with TV codes the discriminated distribution of correlations is generalised to 

nLP«(«|m(')) 



P®(u\m) cx 



(P{u\m)) 



N-l ' 



with 

m = (r, , . . . , u; W), m« = (r, w^'^ , . . . , w^'-'^ , w^'+'^ , . . . , W), 
and an independence assumption as in ( fTTT l and (|5]l. 
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The definition of the discriminated symbol probabilities then becomes 



(Pc.(x,M|m))^-i ■ 
Moreover, for globally maximal discriminators 

(x|m) = Plj^{x\r) and P'^{u\m) = P^''^u\r) 

remains true. The others lemmas and theorems above can be likewise generalised. Hence, discriminator 
decoding by GAUSS approximations applies to multiple dually coupled codes, too. 

Remark 27 {Iterative Algorithm) The generalisation of Algorithm|3]may be done by using 

V, = argminiJ(Cf |w,|lm) under i?(C>,) > H {C^^^ \v^\\m^^^) for alH 

V 

N 

as constituent code dependent update. 



Overall this gives - provided the distinguished well defined solution is found - that discriminator decoding 
asymptotically performs as typical decoding for a random code. I.e., with dually coupled codes and (to the 
distinguished solution convergent) Gauss approximated discriminator decoding the capacity is attained. 

Remark 28 (Complexity) The complexity of decoding is of the order of the sum of the constituent trellis 
complexities and thus generally increases with the number of codes employed. For a fixed number of 
constituent codes of fixed trellis state complexity and GAUSS approximated discriminators the complexity 
thus remains of the order 0{n). 

Remark 29 {Number of Solutions) For a coupling with many constituent codes one obtains a large number 
of non linear optimisations that have to be performed simultaneously. The non linearity of the common 
problem should thus increase with the number of codes. Another explanation is that then many times 
typicality is assumed. The probability of some non typical event then increases. This may increase the 
number of stable solutions of the algorithm or introduce instability. 

This behaviour may be mitigated by the use of punctured codes. The punctured positions define beliefs, 
too, which gives that the transfer vector w is generally longer than n. The transfer complexity is thus 
increased, which should lead to better performance. Note that this approach is implicitly used for LDPC 
codes. 



3.3. Channel Maps 

In the last sections only memory-less channel maps as given in Remark |4] were considered. A general 
channel is given by a stochastic map 

/C : S ^ RAefmtAhy PR\s{r\s). 

We will here only consider channels where signal and "noise" are independent. In particular we assume 
that the channel JC is given by some known deterministic map 

H: s^v = (wi, ...,Vn) 

and r — V + e with the additive noise E defined by Psie). 
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A code map C prior to the transmission together with the map Ti. may then be considered as a concatenated 
map. The concatenation is hereby (for the formal representation by dually coupled code see the proof of 
Theorem[T]i equally represented by the dual coupling of the "codes" 

C(i) {c(i) = (c, 2) : c e C} and C^^' := {c'^) = {s,v) : s e ^ andH : s ^ v} 

where z — {zi, . . . , Zn) i& undefined, i.e., no restriction is imposed on z. Moreover, c is punctured prior to 
transmission and only i; + e is received. Discriminator based decoding thus applies and one obtains 

Pc.(x|m) cx 2^ Pp.(x,M|ti,(i),«;(2)) 
as by the definition of the dually coupled code 

P^^^{x,u\m'-^^) ^ P^^^{x,u\w^^'^)P{u\r) 

and by the independence assumption Pc- (x, tt|m) = Pci{x,u\w'-^\w^^'>)P{u\r) are independent of the 
channel. 

Remark 30 (Trellis) If a trellis algorithm exists to compute P^^, {x\r) then one may compute the symbol 

probabilities P^\x\r,w''^'>), the mean values and variances of u under P^\x,u\r,w''^^'>) with similar 
complexity. 

Example 7 A linear time invariant channel with additive white GAUSSian noise ^(i) is given by the map 

/oo 
s{t-T)h{T)dT + e{t). 
-oo 

Here, we assume a description in the equivalent base band. I.e., the signals r(t) and s(t) as well as the 
noise may be complex valued - indicated by the underbar. The noise is assumed to be white and thus 
exhibits the (stationary) correlation function E^^t) {t + t)] = '^E{t) ' '^(''')- 

For amplitude shift keying modulation one employs the signal 

oo 

s{t) — Siw{t — iT) with ■u;(r) being the waveformer. 

i— — oo 

With a matched filter and well chosen whitening filter one obtains an equivalent (generally complex valued) 
discrete channel 

M 

Q : s 1-^ r with = ^ ^i-jQj + (39) 
j=o 

defined by q = (go, • • ■ , 9m) and independent Gauss noise E£;[eie*] = a\.6i-j. 

For binary phase shift keying one has Si — Axi and Xi G B. 
For quaternary phase shift keying the map is given by 

Si = +]X2i+l), 

= —1, and Xi E B. In both cases a trellis for S may be constructed with logarithmic complexity 
proportional to the memory M of the channel q = (go, 9i, • ■ ■ , ^m) times the number of information Bits 
per channel symbol Si. Note, moreover, that a time variance of the channel does not change the trelhs 
complexity. 
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3.4. Channel Detached Discrimination 



Overall one obtains for a linear modulation and linear channels with additive noise the discrete probabilistic 
channel map 

IC : s ^ r = sQ + e. (40) 

For uncorrelated Gauss noise E this gives the probabilities (without prior knowledge about the code 
words) 



P(c|r)ocexp2(-i^i4^||r-cQ"2^ 



If the channel has large memory M and/or if a modulation scheme with many Bits per symbol s,; is used 
then the trellis complexity of a trellis equalisation becomes prohibitively large. To use the channel map as 
a constituent code will then not give a practical algorithm. 

Reconsider therefore the computation of the discriminated symbol probabilities P^, {x\m) under the as- 
sumption that the employed code is already a dually coupled code with, to again simplify the notation, only 
two constituent codes. 

To apply the discriminator based approach one thus needs to compute 



Obviously, the complexity of the computation of the symbol probabilities P^'^ {x, u\m^''^) of the constituent 
codes under the channel maps is generally prohibitively large. However, one may equivalently (see ( fTTI i on 
Page[T2]) compute 



_ ^ 



P^}\x,u\w(^'>)P^^\x,u\w^^^) 



where uq represents the channel probabilities. An alternative method to compute {x\m) is thus to first 
compute 

I (1) (2), pI,'^x,u\w(^))pI,]\x,u\wW) 
P^,(x,u\w^ ' ') oc - 



Pc,{x,u\wW,w(^)) 



by the constituent distributions P^J {x,u\w^'^'>) for h ^ I and Pci{x,u\w^^\w'^'^'>) to then sum the by 
exp2(Mo) multiplied distributions P®. (x, m|k;(^^ , w^^^^). In the distributions P®. (a:, u\w^^^ , lo^^^) the vari- 
able Mo = log2(P(c|r)) thereby relates to the channel probabilities. The discrimination itself is detached 
from the channel information, i.e., done only by the w^^^^ . 

This approach gives for linear channels and a Gauss approximation a surprisingly small complexity. This 
is the case as for linear channel maps the computation of the "channel moments", i.e., the moments de- 
pending on Mo is not considerably more difficult than the computation of the code moments above. To 
illustrate consider the channel dependent means, i.e., the expectation E[i^ [uola;, it)''*-'] where one obtains 

E^il[uo\x,w(^^]:^ ^ MoP(c|ti;W)= ^ log2(P(c|r)) P(c|t,;W) 

= E«[log2(P(c|r))|:E, = const + i^|#Eg [\\r - cQ\\'\x, u;^]. (42) 

This is similar to the computation of the variances on Page|23] Generally holds that the means and corre- 
lations can be computed for linear channels with complexity that increases only linearly with the channel 
memory M. This result follows as the expectations for the channels remain computations of moments, but 
now with vector operations. The computation of the variance of uq (for a channel with memory) is, e.g., 
equivalent to the computation of a fourth moments in the independent case. 
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Generally holds that the mean values E^-* [uo\x, w^'^^] are only computable up to a constant. This is under 
a Gauss assumption and (HTt equivalent to a shift of uq in exp2(uo) by this constant. However, this will 
lead to a proportional factor, which vanishes in the computation of P® {x\m). This unknown constant may 
thus be disregarded. 

Remark 31 (Constituent Code) This approach applies by 

P^j{x\m^''>) oc J2 PcAx^u\w^^^)exp2{uo) for I ^ h 

to the constituent codes C^'\ too: One may likewise compute the constituent behefs via the moments and 
a Gauss approximation and thus apply Algorithmic] 

The Gauss approximation for uo surely holds true if the channel is short compared to the overall length 
as then many independent parts contribute. With dTTT l one can thus apply the iterative decoder based on 
Gauss approximated discriminators for linear channels with memory without much extra complexity. 

Remark 32 (Matched Filter) Note that one obtains by ( |42] | for the initialisation w^^^ = 0, Z = 1, 2 that 
Lf (m) is proportional to the "matched filter output" given by qir^ . Moreover, in all steps of the algorithm 
only [m] is directly affected by the channel map. 



3.5. Estimation 



In many cases the transmission channel is unknown at the receiver This problem is usually mitigated by 
a channel estimation prior to the decoding. However, an independent estimation needs - especially for 
time varying channels |7 | - considerable excess redundancy. The optimal approach would be to perform 
decoding, estimation, and equalisation simultaneously. 

Example 8 Assume that it is known that the channel is given as in ( |39] l, but that the channel parameters 
9 — (<ZOj • ■ • , 9l) are unknown. Moreover, assume that the transmission is in the base band, which gives 
that the qi are real valued. The aim is to determine these values together with the code symbol decisions. 
To consider them in the same way, i.e., by decisions one needs to reduce the (infinite) description entropy. 
We therefore assume a quantisation of q by a binary vector h. This may, e.g., be done by 

Bi-l 

q ^ exp2(j), = l{i - 1) + /(O) = 0, and h e B. 

Note that one uses the additional knowledge \qi\ < qexp2(i?i) under this quantisation. Moreover, the 
quantisation error tends to zero with the quantisation step size q. Finally, surely a better quantisation can 
be found via rate distortion theory. 



The example shows that one obtains with an appropriate quantisation additional binary unknowns bj. Thus 
one needs additional parameters w^fj^^ that discriminate these Bits. Moreover, again a probability distribu- 
tion is needed for these . Here it is assumed that the distribution given in Q is just extended to these 
parameters. Note that this is equivalent to assuming that code Bits Ci and "channel Bits" bj are independent. 

The code symbol discriminated probabilities remain under the now longer w as in (l4Tl i. Additionally one 
obtains discriminated channel symbol probabilities given by 
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A Gauss approximated discrimination is thus as before, however, one needs to compute new and more 
general expectations. E.g., for the general linear channel of (|40] | one needs to compute the expectation 
given by 

EgKlx, = const - i^i#Eg)[|jr - cQ(6)f|x, 
and equivalently for E^'' [uo|x, lo^''^]. 

The expectations are generalised because Q is a map of the random variables hj. With the quantisation 
of the example above this map is linear in b. This first gives that cQ{b) can be considered as a quadratic 
function in the binary random variables x and b. The computation of the means is thus akin to the one of 
fourth moments and a known independent channel. 

Overall this gives that the complexity for the computation of the means and variances is for unknown chan- 
nels "only" twice as large as for a known channel (of the same memory). It may, however, still be computed 
with reasonable complexity. Hence, again an iteration based on Gauss approximated discriminators can 
be performed. 

Remark 33 (Miscellaneous) Note that without some known "training" sequence in the code word the 
iteration will by the symmetry usually stay at ii;'') = 0. Note, moreover, that this approach is easily 
extended to time variant channels as considered in |7| or even to more complex, i.e., non linear channel 
maps. The complexity then remains dominated by the complexity of the computation of the means and 
correlations. 



4. Summary 

In this paper first (dually) coupled codes were discussed. A dually coupled code is given by a juxtaposi- 
tion of the constituent parity check matrices. Dually coupled codes provide a straightforward albeit pro- 
hibitively complex computation of the overall word probabilities P^°■^s\r) by the constituent probabilities 
P*^'^ {s\r). However, for these codes a decoding by belief propagation applies. 

The then introduced concept of discriminators is summarised by augmenting the probabilities by additional 
(virtual) parameters lo^'^ and u to P(s, u\r, w^^^f , w'^^'>). This is similar to the procedure used for belief 
propagation but there the parameter u is not considered. Such carefully chosen probabilities led (in a glob- 
ally maximal form) again to optimum decoding decisions of the coupled code. However, the complexity 
of decoding with globally maximal discriminators remains in the order of a brute force computation of the 
ML decisions. 

It was then shown that local discriminators may perform almost optimally but with much smaller complex- 
ity. This observation then gave rise to the definition of well defined discriminators and therewith again an 
iteration rule. It was then shown that this iteration theoretically admits any element of the typical set of the 
decoding problem as fixed point. 

In the last chapter the central limit theorem then led to a Gauss approximation and a low complexity 
decoder Finally (linear) channel maps with memory were considered. It was shown that under additional 
approximations equalisation and estimation may be accommodated into the iterative algorithm with only 
little impact on the complexity. 
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A. Appendix 



C2 



C4 



C5 



A.1. Trellis Based Algorithms 

The trellis is a layered graph representation of the code space 

E(C) such that every code word c = (ci, . . . , c„) corre- £i 

sponds to a unique path through the trellis from left to right. ^^^^ ^^^^ 

For a binary code every layer of edges is labelled by one /^^x ^^^x /^^x /' ^ 

code symbol Ci G Z2 = {0, 1}. The complexity of the trel- °^ °^ " 

lis is generally given by the maximum number of edges per 

j^ygj. 6 6 Figure 3: Trellis of the (5,4,2) Code 

As example the trellis of a "single parity check" code of length 5 with H ~ (11111) is depicted in the 
figure to the right. Each of the 2"^ paths in the trellis defines c\ to C5 of a code word c of even weight. 

Here only the basic ideas needed to perform the computations in the trellis are presented. A formal de- 
scription will be given in another paper (SI. The description here reflects the operations performed in the 
trellis. I.e., only the lengthening (extending one path) and the junction (combining two incoming paths of 
one trellis node) are considered. 

This is first explained for the VlTERBl \5\ algorithm that finds the code word with minimal distance. 
The "lengthening" is given by an addition of the path correlations as depicted in Figure |4] (a). For the 
combination - the "join" operation - only the path of maximum value is kept. This is equivalent to a 
minimisation operation for the distances. This is reflected in the name of the algorithm, which is often 
called min-sum algorithm. 

On the other hand, the BCJR 11] algorithm (to compute {x\r)) is often called sum-product algorithm 
as the lengthening is performed by the product of the path probabilities. The combination of two paths is 
given by a sum. These operations are summarised in Figure|4](b). 



(1) 



O 



P{xi\n) 



(1) 



(2) 



(a) VlTERBI Algorithm 



(b) BCJR Algorithm 



Figure 4: Basic Operations in the VlTERBl and BCJR Algorithms 



Remark 34 {Forward-Backward Algorithm) For the VlTERBl algorithm the ML code word is found by 
following the selected paths (starting from the end node) in backward direction. 

The operations of the BCJR algorithm (in forward direction) give at the end directly the probabilities 
P^^ {x\r). To compute all P^^ {x\r) the BCJR algorithm has be performed into both directions. 

The same holds true for the algorithms below. This is here not considered any further - but keep in mind 
that only by this two way approach symbol based distributions or moments can be computed with low 
complexity. 

In the following we shall reuse the notation of Figured and use the indexes s and e before respectively 
after the lengthen or join operation. 
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A.1 .1 . Discrete Sets 



To compute a hard decision distribution one can just count the number of words of a certain distance to r . 
Let this number be denoted D{t) for weight t gZ. 

This can be done in the trelUs by using for the lengthening operation from Ds{t) to {t) by 

T ''''''' 

[Ds{t) for Ci=ri. 

The junction of paths becomes just 

Given D{t) and a BSC with error probabihty p one may compute the probability of having words of 
distance t by 

P(t|r)ocD(t)-p*(l-p)"-*. 
This can also be done directly in the trelUs by 



P. 



\p-Ps{t-l) foTCi^ri 
[{l-p)-Ps{t) for Ci=ri. 



A. 1.2. Moments 

For the mean value 

n i i—1 

IX = E[rc^] = ^ E[riCi] holds Cjrj\ci\ = CjTj] + Cjrj. 

i=l j=l j=l 

This directly gives that one obtains for the lengthening 

PjL) = Ps ■ 2"* and =Hs+ na. 
The junction is just the probability weighted sum of the prior computed input means given by 

Pi^' = Pi^) + Pf' and Mi^) = ^/.(i) + ^Mi^). 

Hence, the BCJR algorithm for the probabilities needs to be computed at the same time. Note that the 

obtained mean values are then readily normalised. 

To compute the "energies" S = E[(^*^^ (^jfj)^] one uses in the same way that 

i i-1 i-1 

S = E[{Y,cjrjf\ci] = mY^c.r.f] + ^an ■ E^c.-r,-] + {anf . 
j=i j=i j=i 

This additionally gives - to the then necessary computation of means and probabiUties - that lengthening 
and junction are now given by 

p(l) p(2) 

= 5, + 2nci ■ + {naf and = -^5(1) + -^^i'^- 

Here again the normalisation is already included. Correlation and higher order moment trellis computations 
are derived in the same way. However, for an Z— th moment alH — 1 lower moments and the probability 
need to be additionally computed. Moreover, the description gives that these moment computations may 
be performed likewise for any linear operation cQ (defined over the field of real or complex numbers) then 
using vector operations. 
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A.1.3. Continuous Sets 



Another possibility to use the trelHs is to compute (approximated) histograms for u — wc^ with Wi G M. 
and Ci £ B. It is here proposed (other possibilities surely exist) to use - as in the hard decision case above 
- a vector function {h{t), ^) with i € Z and \t\ < Q and the mean value /i. I.e., the values of u with non 
vanishing probabihty are assumed to be in a vicinity the mean value /i (computed above) or 

p{u\m) = for |u — /i| > Qe. 

Thus {h{t), ii) is defined to be the approximation of 

h{t) « / p{u — ^|Tn)du. 

Here, densities are used to simplify the notation. It is now assumed that the mean values are computed as 
above, which gives that the lengthening is the trivial operation 

The junction, however, cannot be easily performed as usually the mean values do not fit on each other 
Here, it is assumed that the density has for any interval the form of a rectangle. Note that this is again a 
maximum entropy assumption. 

This gives the approximation of the histogram h\?^ (t) by the junction operation to be 

p(l) p(2) p(l) p(2) 



and 



J e e J e e 



with [zj the integer part, trunc(z) :— z — [zj, 

a{i,P , Mi^) ) + b{t^^ , n^^^ ) = 1 , and b^^^ , A^i^) ) = trunc((/i(^-) - )s). 

A.2. Computation of Lf{m) 

Equation ( fT4l i gives the logarithmic probability ratio 

Lf{m) =n+ ^(m(i)) + Lf\m^'^) + Lf{m). 

The first three terms can be computed as before. For the computation of Lf [m] use that 

P^\x\m)o^ / ^^'^ ' ' . / , ' Uu^: / pl.{x,u\m)du. (43) 

u u 

To compute (l43T l a multiplication of multivariate Gauss distributions has to be performed. The moments 
of the multivariate distributions jf^j, {u\x, m*^'^) and pd {u\x, m) are defined by 

MS(a;) = Ee(0|c.K>,m«] and Ag. = EccoicJK" - mS)K - ^^f,l)\^,rn^'^] 
and likewise for Hij{x) and Aij^k{x). 
The multivariate Gauss distributions are of the form 

pc,{u\x,m) = \ , exp{-{u~ti,{x))[2A^ix)]-\u~ti,{x)f) . 

^/\2^^A,{x)\ 
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Set 



b'^\x) = [Af\x)] and B,{x) = [Mx)]'' 
The operation in (l43T l then leads to 



p^^{x,u\m) = 



exp C, (x, m) {u- (,f{x)) 2Af{x) (n- Af(^))' 



2nAf{x) 



with 



'Af{x)[ =bI%)+b'^'\x)~b,{x) 

by a comparison of the terms 

by a comparison of the in u linear terms, and 

-Mf\.)Bi%)Mf^-(x)-logl^^^'(^)ll^^^^^(-)l 



|Af(x)||A,(x)| 



by a consideration of the remaining constant. 

From the definition of the multivariate distributions then follows that 



pQ.{x\m) (X / pQ.{x,u\m)du — exp(Cj (x|m)), 



respectively Lf{m) = -log2(e) • (C, (+l|m) - Q (-l|m)). 
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